Examining the Quality of White Wine

The following analysis investigates factors affecting the quality of white wine, focussing on 11 variables that quantify the chemical properties of wine and an overall quality score.

The details of each variable are available in the attached text file (citations.txt).

Overview of Data

After loading in the data, I’ll take a look at the variables within the dataframe:

## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

I will also take a look at a summary of the data:

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
  1. Looking at the quality of the wine in the data set, most of wine has been given a quality score of 6, the mean, median and 3rd quartile score is 6. No wine was given a quality score of 10, or 1 or 2.
  2. The mean alcohol content is 10.51%, with a median of 10.4%, a minimum value of 8% and a maximum value of 14.2%.
  3. The pH levels tend to be around 3, with the mean pH level at 3.18, the median 3.188 with a range of 2.72 - 3.82.
  4. Residual sugar ranges from 0.6 - 65.8. The maximum value seems to be an outlier, with the 3rd quartile level being 9.9 and a mean of 6.391.
  5. Free sulfur dioxide levels range from 2 - 289, with a mean of 35. Here the maximum value also looks to be an outlier.
  6. The mean level of fixed acidity is 6.855, with a range of 3.8 - 14.2.

Univariate Plots Section

Quality

Since from the summary data I know no wine has been scored below 3 or above 9 for quality, I set the limits for the histogram accordingly. The mean (red line), and first and third quartiles (red dashes) have been included showing that with most wines have been scored a 5 or a 6. The median score is also 6.

Fixed Acidity

## 
##  3.8  3.9  4.2  4.4  4.5  4.6  4.7  4.8  4.9    5  5.1  5.2  5.3  5.4  5.5 
##    1    1    2    3    1    1    5    9    7   24   23   28   27   28   31 
##  5.6  5.7  5.8  5.9    6  6.1 6.15  6.2  6.3  6.4 6.45  6.5  6.6  6.7  6.8 
##   71   88  121  103  184  155    2  192  188  280    1  225  290  236  308 
##  6.9    7  7.1 7.15  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9    8  8.1  8.2 
##  241  232  200    2  206  178  194  123  153   93   93   74   80   56   56 
##  8.3  8.4  8.5  8.6  8.7  8.8  8.9    9  9.1  9.2  9.3  9.4  9.5  9.6  9.7 
##   52   35   32   25   15   18   16   17    6   21    3   11    2    5    4 
##  9.8  9.9   10 10.2 10.3 10.7 11.8 14.2 
##    8    2    3    1    2    2    1    1

Most wines have between 4g/L-10g/L of fixed acidity. The quartile lines show that the distribution of the data is symmetrical, similar to a normal distribution.

Volatile Acidity

Most wines have volatile acidity between 0.1-0.7 although there are a number of outliers. The mean and median are both between 0.2 and 0.3, with the first and third quartiles around 0.22 and 0.32 respectively.

Citric Acid

Most wines have a citric acid level of less than 0.75g/L. The data looks to be farily symmetrical, with the mean and median almost equal.

Residual Sugar

The data is positively skewed.

## 
##   0.6   0.7   0.8   0.9  0.95     1  1.05   1.1  1.15   1.2  1.25   1.3 
##     2     7    25    39     4    93     1   146     3   187     3   147 
##  1.35   1.4  1.45   1.5  1.55   1.6  1.65   1.7  1.75   1.8  1.85   1.9 
##     2   184     4   142     2   165     2    99     1    99     3    59 
##  1.95     2  2.05   2.1   2.2  2.25   2.3  2.35   2.4   2.5   2.6  2.65 
##     2    79     1    51    56     2    42     1    41    40    33     1 
##   2.7   2.8  2.85   2.9     3   3.1  3.15   3.2   3.3   3.4   3.5   3.6 
##    38    36     1    25    17    17     1    28    23    13    31    22 
##   3.7  3.75   3.8  3.85   3.9  3.95     4   4.1   4.2  4.25   4.3  4.35 
##    12     2    21     3    17     3    19    17    31     2    19     1 
##   4.4  4.45   4.5  4.55   4.6   4.7  4.75   4.8  4.85   4.9     5   5.1 
##    14     3    33     2    40    29     5    38     1    35    43    28 
##  5.15   5.2  5.25   5.3  5.35   5.4  5.45   5.5  5.55   5.6   5.7   5.8 
##     2    29     4    17     2    23     2    13     1    16    30    23 
##  5.85   5.9  5.95     6   6.1   6.2   6.3  6.35   6.4   6.5  6.55   6.6 
##     2    19     1    23    21    31    39     1    34    26     1    30 
##  6.65   6.7  6.75   6.8  6.85   6.9  6.95     7  7.05   7.1   7.2  7.25 
##     3    25     1    28     6    20     1    31     2    36    29     2 
##   7.3  7.35   7.4  7.45   7.5   7.6   7.7  7.75   7.8  7.85   7.9  7.95 
##    19     2    40     1    30    29    34     2    41     1    32     1 
##     8   8.1  8.15   8.2  8.25   8.3   8.4  8.45   8.5  8.55   8.6  8.65 
##    32    34     1    36     2    31    13     1    24     1    27     1 
##   8.7  8.75   8.8   8.9  8.95     9  9.05   9.1  9.15   9.2  9.25   9.3 
##    18     2    22    23     1    18     1    17     2    22     2    11 
##   9.4   9.5  9.55   9.6  9.65   9.7   9.8  9.85   9.9    10 10.05  10.1 
##    10     9     1    18     4    22    16     3    18    18     3    14 
##  10.2  10.3  10.4  10.5 10.55  10.6 10.65  10.7  10.8  10.9    11  11.1 
##    23    16    25    16     1    22     1    26    17    11    19    18 
##  11.2 11.25  11.3  11.4 11.45  11.5  11.6  11.7 11.75  11.8  11.9 11.95 
##    18     2    12    14     1    11    15     8     4    35    16     3 
##    12 12.05  12.1 12.15  12.2  12.3  12.4  12.5 12.55  12.6  12.7 12.75 
##    16     1    21     4    15    13    19    16     2    16    16     1 
##  12.8 12.85  12.9    13  13.1 13.15  13.2  13.3  13.4  13.5 13.55  13.6 
##    25     4    25    19    23     1    13    16     7    10     3    12 
## 13.65  13.7  13.8  13.9    14 14.05  14.1 14.15  14.2  14.3 14.35  14.4 
##     4    21     8    18    16     1     4     1    20    17     3    17 
## 14.45  14.5 14.55  14.6  14.7 14.75  14.8  14.9 14.95    15  15.1 15.15 
##     3    17     3    13    14     2    12    14     2    13     7     1 
##  15.2 15.25  15.3  15.4  15.5 15.55  15.6  15.7 15.75  15.8  15.9    16 
##     6     1     9    17    11     6    14     9     1     6     2    10 
## 16.05  16.1  16.2  16.3  16.4 16.45  16.5 16.55  16.6 16.65  16.7 16.75 
##     6     2     7     7     5     1     3     1     2     5     5     2 
##  16.8 16.85  16.9 16.95    17 17.05  17.1  17.2  17.3 17.35  17.4 17.45 
##     4     4     3     3     1     1     5     9    14     1     2     2 
##  17.5 17.55  17.6  17.7 17.75  17.8 17.85  17.9 17.95    18 18.05  18.1 
##     8     3     2     1     4    13     5     2     3     2     3     6 
## 18.15  18.2  18.3 18.35  18.4  18.5  18.6 18.75  18.8  18.9 18.95  19.1 
##     8     3     2     4     1     1     1     4     3     1     3     1 
## 19.25  19.3 19.35  19.4 19.45  19.5  19.6  19.8  19.9 19.95 20.15  20.2 
##     3     4     1     2     3     2     1     4     1     3     1     2 
##  20.3  20.4  20.7  20.8    22  22.6  23.5 26.05  31.6  65.8 
##     1     1     2     2     2     1     1     2     2     1

Taking the log10 of the data, the data looks to be bimodal with a peak at around 0.25g/L and another at around 1g/L.

Chlorides

There are also a number of outliers in the data for chlorides.

## 
## 0.009 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019  0.02 0.021 0.022 
##     1     1     1     4     4     5     5    10     9    16    19    19 
## 0.023 0.024 0.025 0.026 0.027 0.028 0.029  0.03 0.031 0.032 0.033 0.034 
##    20    34    30    54    58    85    81   108   107   109   119   168 
## 0.035 0.036 0.037 0.038 0.039  0.04 0.041 0.042 0.043 0.044 0.045 0.046 
##   130   200   160   167   157   182   147   184   141   201   170   181 
## 0.047 0.048 0.049  0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058 
##   171   174   133   170   115   104   130    99    61    88    68    53 
## 0.059  0.06 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069  0.07 
##    36    46    19    25    23    15     8    18    18     7    18     6 
## 0.071 0.072 0.073 0.074 0.075 0.076 0.077 0.078 0.079  0.08 0.081 0.082 
##     5     2     5     8     2     9     1     2     4     4     2     2 
## 0.083 0.084 0.085 0.086 0.087 0.088 0.089  0.09 0.091 0.092 0.093 0.094 
##     5     5     3     4     3     2     1     2     1     3     3     5 
## 0.095 0.096 0.097 0.098 0.099 0.102 0.104 0.105 0.108  0.11 0.112 0.114 
##     2     6     1     3     1     1     1     1     2     3     1     1 
## 0.115 0.117 0.118 0.119  0.12 0.121 0.122 0.123 0.126 0.127  0.13 0.132 
##     1     3     1     3     1     2     1     4     3     2     1     1 
## 0.133 0.135 0.136 0.137 0.138 0.142 0.144 0.145 0.146 0.147 0.148 0.149 
##     1     1     1     2     2     3     1     1     1     2     1     1 
##  0.15 0.152 0.154 0.156 0.157 0.158  0.16 0.167 0.168 0.169  0.17 0.171 
##     1     2     1     1     4     1     2     2     3     2     2     1 
## 0.172 0.173 0.174 0.175 0.176 0.179  0.18 0.184 0.185 0.186 0.194 0.197 
##     2     2     2     2     2     1     1     2     2     1     1     2 
##   0.2 0.201 0.204 0.208 0.209 0.211 0.212 0.217 0.239  0.24 0.244 0.255 
##     1     2     1     2     1     1     1     1     1     1     1     1 
## 0.271  0.29 0.301 0.346 
##     1     1     1     1

Most wines have a chloride level of between 0.03g/L-0.06g/L.

Free Sulfur Dioxide

## 
##     2     3     4     5     6     7     8     9    10    11  11.5    12 
##     1    10    11    25    32    25    35    29    55    45     1    51 
##    13    14    15  15.5    16    17    18    19  19.5    20    21    22 
##    55    68    79     1    58    89    80    84     1   101    93   102 
##    23  23.5    24    25    26    27    28  28.5    29    30  30.5    31 
##   110     1   118   111   129    99   112     1   160    99     1   132 
##    32    33    34    35  35.5    36    37    38  38.5    39  39.5    40 
##   109   112   128   129     2   127   111   102     1    89     1   103 
##  40.5    41  41.5    42  42.5    43  43.5    44  44.5    45    46    47 
##     1   104     2    86     1    63     1    75     4   101    64    91 
##    48  48.5    49    50  50.5    51  51.5    52  52.5    53    54    55 
##    66     7    82    64     2    54     1    72     4    68    61    58 
##    56    57    58    59  59.5    60  60.5    61  61.5    62    63    64 
##    42    44    37    39     2    38     2    47     1    29    30    23 
##  64.5    65    66    67    68    69    70  70.5    71    72    73  73.5 
##     1    14    17    22    24    17    11     1     5     6     8     4 
##    74    75    76    77  77.5    78    79  79.5    80    81    82  82.5 
##     5     7     5     5     1     4     2     4     1     7     2     1 
##    83    85    86    87    88    89    93    95    96    97    98   101 
##     4     2     2     4     1     1     1     1     3     1     3     2 
##   105   108   110   112 118.5 122.5   124   128   131 138.5 146.5   289 
##     2     3     1     1     1     1     1     1     1     1     1     1

Changing the bin width and excluding the outlier gives the following graph:

The typical range for free sulfur dioxide in wine is between 10mg/L-70mg/L.

Total Sulfur Dioxide

In general, total sulfur dioxide ranges between 75mg/L-225mg/L, with a few outliers.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Density appears to be fairly volatile, most of the wine within a fairly narrow range between 0.989 and 0.996.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The pH level for the wine is typically between 3 and 3.5. Looking at the quartiles and the median, the data looks to be fairly symmetrical with the mean pH level 3.188 and the median 3.18.

Sulphates

The levels of sulphates are slightly positively skewed, generally the range is between 0.3g/L-0.6g/L for most wines.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

## 
##                8              8.4              8.5              8.6 
##                2                3                9               23 
##              8.7              8.8              8.9                9 
##               78              107               95              185 
##              9.1              9.2              9.3              9.4 
##              144              199              134              229 
##              9.5 9.53333333333333             9.55              9.6 
##              228                3                2              128 
## 9.63333333333333              9.7 9.73333333333333             9.75 
##                1              105                2                1 
##              9.8              9.9               10 10.0333333333333 
##              136              109              162                1 
##             10.1 10.1333333333333            10.15             10.2 
##              114                2                3              130 
##             10.3             10.4 10.4666666666667             10.5 
##               85              153                2              160 
## 10.5333333333333            10.55 10.5666666666667             10.6 
##                1                2                1              114 
##            10.65             10.7             10.8             10.9 
##                1               96              135               88 
## 10.9333333333333 10.9666666666667            10.98               11 
##                2                3                1              158 
##            11.05 11.0666666666667             11.1             11.2 
##                2                1               83              112 
## 11.2666666666667             11.3 11.3333333333333            11.35 
##                1              101                3                1 
## 11.3666666666667             11.4 11.4333333333333            11.45 
##                1              121                1                4 
## 11.4666666666667             11.5            11.55             11.6 
##                1               88                1               46 
## 11.6333333333333            11.65             11.7 11.7333333333333 
##                2                1               58                1 
##            11.75             11.8            11.85             11.9 
##                2               60                1               53 
##            11.94            11.95               12            12.05 
##                2                1              102                1 
## 12.0666666666667             12.1            12.15             12.2 
##                1               51                2               86 
##            12.25             12.3 12.3333333333333             12.4 
##                1               62                1               68 
##             12.5             12.6             12.7            12.75 
##               83               63               56                3 
##             12.8 12.8933333333333             12.9               13 
##               54                2               39               36 
##            13.05             13.1 13.1333333333333             13.2 
##                1               18                1               14 
##             13.3             13.4             13.5            13.55 
##                7               20               12                1 
##             13.6             13.7             13.8             13.9 
##                9                7                2                3 
##               14            14.05             14.2 
##                5                1                1

While the median alcohol percentage is 10.4%, the mean is 10.5% and for most of the wine, the alcohol levels are between 9%-12%.

Total Acidity

While volatile acidity should be kept as low as possible, overall acidity is important in wine, too high can lead to a sour tasting wine, and too low will result in a flat taste.

Therefore, I have decided to look at total acidity in the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.130   6.890   7.405   7.467   7.960  14.960

The plot looks fairly symmetrical, similar to a normal distribution however there are a few outliers.

Acidity/Sugar Ratio

Since acidity can counterbalance sweetness in a wine, I would like to look at this ratio.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1423  0.7768  1.3840  2.5330  4.2560 15.4800

Univariate Analysis

Structure of Dataset

There are 4898 different wines in the dataset, and 12 variables have been examined: -fixed acidity -volatile acidity -citric acid -residual sugar -chlorides -free sulfur dioxide -total sulfur dioxide -density -pH -sulphates -alcohol -quality

All variables are floating point numbers with the exception of quality which is an integer. The wine in the dataset has been given a quality rating from 0 (worst) to 10 (best).

Other observations: 1. The spread of the quality of the white wine resembles a normal distribution with a peak of 6. 2. Acidity (fixed and volatile), sulphur levels (total and free) and pH levels also display characteristics of a normal distribution. 3. Alcohol content seems to be fairly evenly distributed between the 9%-12% level, with the exception of around 10% which seems to have a higher frequency. 4. Residual sugar levels are relatively low for the majority of the sample, between 0g/L-2g/L 5. There is wine with the residual sugar level of 65.8g/L, which seems to be an outlier. 6. There are also outliers with high levels of chlorides, acidity (free and fixed) and free sulfur dioxide. 7.Both fixed acidity and volatile acidity have clear peaks.

New Variables

I included a total acidity variable, adding together fixed acidity and citric acid with volatile acidity. While levels of volatile acidity should be kept as low as possible to avoid fermentation, the balance of overall acidity could determine whether a particular wine is flat tasting (acidity too low), or sour (acidity too high).

I have also looked at the ratio of total acidity to residual sugar. Perhaps there is an optimum level?

Unusual Distributions & Adjustments to Data

I log transformed the left skewed residual sugar distribution. This resulted in a bimodal graph with peaks at 0.25g/L and 1g/L.

Features of Dataset

The main feature of this dataset is quality. I will be examining factors that affect the quality of wine.

In order to determine which particular variables I will look at more closely, I would first like to examine correlation data.

Correlation

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.25581431      0.002857966
## fixed.acidity        -0.255814305    1.00000000     -0.022697290
## volatile.acidity      0.002857966   -0.02269729      1.000000000
## citric.acid          -0.149899918    0.28918070     -0.149471811
## residual.sugar        0.006623775    0.08902070      0.064286060
## chlorides            -0.045645192    0.02308564      0.070511571
## free.sulfur.dioxide  -0.011928911   -0.04939586     -0.097011939
## total.sulfur.dioxide -0.161979037    0.09106976      0.089260504
## density              -0.185976097    0.26533101      0.027113845
## pH                   -0.115774132   -0.42585829     -0.031915368
## sulphates             0.009807759   -0.01714299     -0.035728147
## alcohol               0.213656245   -0.12088112      0.067717943
## quality               0.035763247   -0.11366283     -0.194722969
## total.acidity        -0.263216664    0.98717874      0.071570617
## aciditySugar.ratio   -0.062868153    0.11208572     -0.105308868
##                       citric.acid residual.sugar   chlorides
## X                    -0.149899918    0.006623775 -0.04564519
## fixed.acidity         0.289180698    0.089020701  0.02308564
## volatile.acidity     -0.149471811    0.064286060  0.07051157
## citric.acid           1.000000000    0.094211624  0.11436445
## residual.sugar        0.094211624    1.000000000  0.08868454
## chlorides             0.114364448    0.088684536  1.00000000
## free.sulfur.dioxide   0.094077221    0.299098354  0.10139235
## total.sulfur.dioxide  0.121130798    0.401439311  0.19891030
## density               0.149502571    0.838966455  0.25721132
## pH                   -0.163748211   -0.194133454 -0.09043946
## sulphates             0.062330940   -0.026664366  0.01676288
## alcohol              -0.075728730   -0.450631222 -0.36018871
## quality              -0.009209091   -0.097576829 -0.20993441
## total.acidity         0.394143356    0.104737493  0.04552987
## aciditySugar.ratio    0.022546906   -0.764289501 -0.04688074
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                          -0.0119289106         -0.161979037 -0.18597610
## fixed.acidity              -0.0493958591          0.091069756  0.26533101
## volatile.acidity           -0.0970119393          0.089260504  0.02711385
## citric.acid                 0.0940772210          0.121130798  0.14950257
## residual.sugar              0.2990983537          0.401439311  0.83896645
## chlorides                   0.1013923521          0.198910300  0.25721132
## free.sulfur.dioxide         1.0000000000          0.615500965  0.29421041
## total.sulfur.dioxide        0.6155009650          1.000000000  0.52988132
## density                     0.2942104109          0.529881324  1.00000000
## pH                         -0.0006177961          0.002320972 -0.09359149
## sulphates                   0.0592172458          0.134562367  0.07449315
## alcohol                    -0.2501039415         -0.448892102 -0.78013762
## quality                     0.0081580671         -0.174737218 -0.30712331
## total.acidity              -0.0451333172          0.113188502  0.27560881
## aciditySugar.ratio         -0.2888136640         -0.369038808 -0.57164873
##                                 pH     sulphates     alcohol      quality
## X                    -0.1157741316  0.0098077589  0.21365624  0.035763247
## fixed.acidity        -0.4258582910 -0.0171429850 -0.12088112 -0.113662831
## volatile.acidity     -0.0319153683 -0.0357281469  0.06771794 -0.194722969
## citric.acid          -0.1637482114  0.0623309403 -0.07572873 -0.009209091
## residual.sugar       -0.1941334540 -0.0266643659 -0.45063122 -0.097576829
## chlorides            -0.0904394560  0.0167628837 -0.36018871 -0.209934411
## free.sulfur.dioxide  -0.0006177961  0.0592172458 -0.25010394  0.008158067
## total.sulfur.dioxide  0.0023209718  0.1345623669 -0.44889210 -0.174737218
## density              -0.0935914935  0.0744931485 -0.78013762 -0.307123313
## pH                    1.0000000000  0.1559514973  0.12143210  0.099427246
## sulphates             0.1559514973  1.0000000000 -0.01743277  0.053677877
## alcohol               0.1214320987 -0.0174327719  1.00000000  0.435574715
## quality               0.0994272457  0.0536778771  0.43557472  1.000000000
## total.acidity        -0.4306513315 -0.0118522486 -0.11751272 -0.131377207
## aciditySugar.ratio    0.0548562103 -0.0008598056  0.25832261 -0.013264024
##                      total.acidity aciditySugar.ratio
## X                      -0.26321666      -0.0628681526
## fixed.acidity           0.98717874       0.1120857243
## volatile.acidity        0.07157062      -0.1053088681
## citric.acid             0.39414336       0.0225469064
## residual.sugar          0.10473749      -0.7642895014
## chlorides               0.04552987      -0.0468807358
## free.sulfur.dioxide    -0.04513332      -0.2888136640
## total.sulfur.dioxide    0.11318850      -0.3690388081
## density                 0.27560881      -0.5716487284
## pH                     -0.43065133       0.0548562103
## sulphates              -0.01185225      -0.0008598056
## alcohol                -0.11751272       0.2583226123
## quality                -0.13137721      -0.0132640236
## total.acidity           1.00000000       0.0976389289
## aciditySugar.ratio      0.09763893       1.0000000000

Since quality seems to correlate with alcohol, density, chlorides and volatile acidity, I will initially be focussing on these variables.

Bivariate Plots Section

Alcohol

Adjusting the plot slightly:

It may be more useful to look at boxplots and a summary for each quality rating.

## whitewine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## -------------------------------------------------------- 
## whitewine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## whitewine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## whitewine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## whitewine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## whitewine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## whitewine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

There does seem to be a pattern, particularly among the better quality wines (quality score of 5+)

Density

Overlaying with a scatterplot, and excluding the outliers:

Density seems to decrease as quality increases.

Two further relationships I would like to look at in more detail are density and residual sugar levels and density and alcohol levels.

This shows that in general, the higher the residual sugar level is, the more dense the wine is.

This shows that the higher the alcohol percentage is, the less dense the wine is.

## whitewine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## -------------------------------------------------------- 
## whitewine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## -------------------------------------------------------- 
## whitewine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## -------------------------------------------------------- 
## whitewine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## whitewine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## whitewine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## whitewine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970

This shows an inverse relationship between density and wine quality.

Volatile Acidity

## whitewine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1700  0.2375  0.2600  0.3332  0.4125  0.6400 
## -------------------------------------------------------- 
## whitewine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1100  0.2700  0.3200  0.3812  0.4600  1.1000 
## -------------------------------------------------------- 
## whitewine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.240   0.280   0.302   0.340   0.905 
## -------------------------------------------------------- 
## whitewine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2000  0.2500  0.2606  0.3000  0.9650 
## -------------------------------------------------------- 
## whitewine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.1900  0.2500  0.2628  0.3200  0.7600 
## -------------------------------------------------------- 
## whitewine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.2000  0.2600  0.2774  0.3300  0.6600 
## -------------------------------------------------------- 
## whitewine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.240   0.260   0.270   0.298   0.360   0.360

The line graph of mean volatile acidity for each quality score and the boxplots show a non-linear pattern. With the wine that has a quality score of 6 or more the volatile acidity actually increases.

Chlorides

The above graph looks like it could suggest the lower the chloride level, the higher the quality. Looking at box plots for each quality rating, we should be able to get more detail:

## whitewine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## whitewine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## whitewine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## whitewine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## whitewine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## whitewine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## whitewine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350

The boxplot above shows that the better wines (5+) have decreasing amounts of chlorides.

Other Variables

In general, it looks as though the lower the acidity is the better the wine is, however wine with a quality rating of 9 seems to go against the general pattern, and has an increased level of acidity- tota, citric and fixed.

This shows no clear pattern.

For most quality scores, the free sulfur dioxide levels remain steady. The notable exceptions are 3,4 and 9.

The boxplot above shows in general, decreasing levels of total sulfur dioxide relates to higher quality wines.

Higher pH seems to relate to a higher quality wine.

Bivariate Analysis

Relationships Observed

Overall, due to the fact that a lot of wines have been given a score of 6, the scatterplots suffered from overplotting.

Quality correlates with alcohol, with the mean and median alcohol percentage increasing with the quality scores.

Density appears to be negatively correlated with quality, with both the mean and the median density decreasing as the quality score increases.

In terms of volatile acidity, there appears to be a non-linear relationship with quality, between quality scores of between 4-6 the volatile acidity decreases as quality increases, and between 6-9 the volatile acidity level increases as quality increases.

Similarly, in terms of chlorides, wine with a quality score of 5 or above has decreasing levels of chlorides as quality increases.

From the bivariate plots, it looks like the mean numbers for better wines ie those with a score of 5 or above display different characteristics to those with lower scores.

Additionally I looked at density in further detail, looking at its relationship with alcohol and with residual sugar. Density increases with residual sugar levels and decreases with alcohol.

The strongest relationships I found in the data were with alcohol and with density in relation to the quality of wine. Chlorides and pH also seem to relate to the quality score.

Multivariate Plots

At this point I’d like to cut the data into ‘bad’, ‘good’ and ‘best’ buckets according to their quality ratings: bad: 3-4 good:5-6 best:7-9

This should make the multivariate plots easier to read.

The above plot highlights the positive correlation between density and residual sugar. The higher quality wines appear to be less dense for a given residual sugar level compared with the lower quality wines.

Looking at the ratio of density to residual sugar, there is a loose positive correlation between this ratio and better quality wines. I will now look at this ratio against alcohol.

There does seem to be a vague pattern with the data, the good wines seem to have a higher alcohol level and lower density/residual sugar ratio.

I should be able to find a stronger pattern in the data.

I will now take a closer look at the relationship between density and alcohol in relation to quality.

As already discussed, the higher quality wines tend to be less dense but I want to take a look at alcohol in more detail in relation to quality:

Better wines tend to have a higher alcohol level.

Looking at the density/alcohol ratio more closely, we get the following boxplots:

For the wines with a score of 5 or more, this shows there is a strong relationship between the density/alcohol ratio and the quality score, with few outliers.

I would also like to look at the relationship between chlorides and the density/alcohol ratio in relation to quality:

The above graph indicates the better wines have lower levels of chlorides and a lower density/alcohol ratio.

Intuitively, I would think that pH and acidity would be related and would have an effect on the quality of wine.

Looking at the scatterplot there does not seem to be a significant relationship between pH and volatile acidity, and quality. The density plot shows most of the wine fits into a narrow range volatile acidity levels. However the pH density plot shows a vague pattern of increasing pH leading to better wine quality.

Since I know from the previous section that pH and quality do seem to be correlated, I will now look at its relationship with alcohol levels and with density:

The boxplots show that there is a negative relationship with this ratio (pH/density/alcohol) and wine quality.

I will now look at the density/alcohol ratio against other variables:

Density/alcohol vs fixed acidity:

Density/alcohol vs citric acid:

Density/alcohol vs free sulfur dioxide:

Density/alcohol vs total sulfur dioxide:

Density/alcohol vs sulphates:

The above scatterplots show the good wines tend to have lower levels of sulphates, total and free sulfur dioxide, however among the ok wines, there seems to be no clear pattern. The boxplots also show no clear pattern across the different qualities of wine.

## 
## Calls:
## m1: lm(formula = quality ~ I(pH/density/alcohol), data = whitewine)
## m2: lm(formula = quality ~ I(pH/density/alcohol) + chlorides, data = whitewine)
## m3: lm(formula = quality ~ I(pH/density/alcohol) + chlorides + I(log10(residual.sugar)), 
##     data = whitewine)
## 
## =============================================================
##                                m1         m2         m3      
## -------------------------------------------------------------
##   (Intercept)                8.805***   8.746***   8.807***  
##                             (0.104)    (0.103)    (0.104)    
##   I(pH/density/alcohol)     -9.477***  -8.671***  -9.147***  
##                             (0.334)    (0.349)    (0.367)    
##   chlorides                            -4.151***  -4.087***  
##                                        (0.562)    (0.561)    
##   I(log10(residual.sugar))                         0.128***  
##                                                   (0.031)    
## -------------------------------------------------------------
##   R-squared                      0.1        0.2        0.2   
##   adj. R-squared                 0.1        0.2        0.2   
##   sigma                          0.8        0.8        0.8   
##   F                            806.9      435.2      296.9   
##   p                              0.0        0.0        0.0   
##   Log-likelihood             -5981.0    -5953.8    -5945.1   
##   Deviance                    3297.5     3261.2     3249.6   
##   AIC                        11968.0    11915.7    11900.2   
##   BIC                        11987.5    11941.7    11932.7   
##   N                           4898       4898       4898     
## =============================================================

The variables in the linear model only account for 0.2% of the variation in quality scores.

Multivariate Analysis

Relationships Observed

The relationship between alcohol and density was a relatively strong one in terms of determining the quality, this was further strengthened by including pH. A low pH/density/alcohol ratio looked like it resulted in a better quality wine. However, this relationship was not strong enough to build a linear model.

Interesting or Surprising Interactions

After researching measures of wine quality on the internet, I expected to find pH (the ‘backbone of wine’), acidity and residual sugar to be significant factors in determining the quality of wine. Intuitively this seems to make sense since these aspects are more easily detectable. However my analysis found that density and alcohol level played a big role in determining the quality of wine.

Further, I expected free and/or total sulfur dioxide levels to also play a significant role in determining the quality of wine, since larger levels should be easily detectable, however this did not appear to be the case.

Models

I tried to create a linear model of quality vs pH/density/alcohol, however this did not yield any significant results. this ratio only accounted for 20% of the variance in the quality score.

I did create a new variable ‘quality.bucket’ in order to group the data by ranges of quality, making it easier to analyse the plots.

Final Plots Summary

Plot 1

## whitewine$quality: 3
## [1] 20
## -------------------------------------------------------- 
## whitewine$quality: 4
## [1] 163
## -------------------------------------------------------- 
## whitewine$quality: 5
## [1] 1457
## -------------------------------------------------------- 
## whitewine$quality: 6
## [1] 2198
## -------------------------------------------------------- 
## whitewine$quality: 7
## [1] 880
## -------------------------------------------------------- 
## whitewine$quality: 8
## [1] 175
## -------------------------------------------------------- 
## whitewine$quality: 9
## [1] 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Description

Of the 4898 bottles of wine in the dataset, 2198 have been given a quality score of 6, and 1457 have been given a quality score of 5, therefore the majority (75%) of the wine is mediocre. Since the median is equal to the third quartile, this also shows how heavily weighted the data is. Overall the data is slightly positively skewed, and there are no observations that were given a score of 1, 2 or 10.

Since the quality of wine seems to involve a delicate balance of a number of different variables, not limited to those found in this dataset, my guess is that this distribution of scores is typical, it is rare to find wines that are exceptional, or exceptionally bad.

Plot Two

## whitewine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## -------------------------------------------------------- 
## whitewine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## -------------------------------------------------------- 
## whitewine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## -------------------------------------------------------- 
## whitewine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## whitewine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## whitewine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## whitewine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970

Description

The mean density decreases as the quality of the wine decreases. Overall a large proportion of wine with a score of 7, 8 and 9 seems to be less dense than the wine with a lower score.

Although the general pattern is increasing quality for decreasing density, the exception to this rule is wine with a quality score of 5, where the mean density is actually higher than the mean density for wine with a quality score of 4.

The general pattern of decreasing density for increasing quality makes sense for white wine, since you would expect good wine to be light in density (although this may be my personal preference also!).

Plot Three

Description

Boxplots show summary data of the relationship between the pH/Density/Alcohol ratio and the quality score given to wine, the scatterplot underneath shows the detail behind the summary plot. The plot shows a tendency towards a lower ratio resulting in a higher quality score. The ratio does seem to show a loose pattern however I believe looking at just three factors in the quality of wine may be oversimplifying the concept.

The summary data is as follows:

## whitewine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2604  0.2925  0.3066  0.3137  0.3194  0.4288 
## -------------------------------------------------------- 
## whitewine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2427  0.2903  0.3182  0.3183  0.3419  0.3897 
## -------------------------------------------------------- 
## whitewine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2294  0.3109  0.3302  0.3267  0.3456  0.4203 
## -------------------------------------------------------- 
## whitewine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2159  0.2800  0.3091  0.3067  0.3340  0.4097 
## -------------------------------------------------------- 
## whitewine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2174  0.2614  0.2837  0.2882  0.3121  0.3828 
## -------------------------------------------------------- 
## whitewine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2219  0.2570  0.2740  0.2821  0.3028  0.3812 
## -------------------------------------------------------- 
## whitewine$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2609  0.2638  0.2649  0.2752  0.2779  0.3086

Reflection

The dataset contains 4898 different variants of white wine, examined using 11 input variables and a quality score. After researching each variable and its expected affects/significance to the quality of wine, I began my analysis by looking at each variable in turn in order to get an overview of the spread of the data. I then used correlation figures as a starting point to look more closely at cross variable patterns and eventually focussed on the pH/Density/Alcohol ratio which seemed to show a relatively strong link to the quality score.

Whilst I did try and create a linear model, the data did not show a strong enough correlation between variables to create an adequate model. In many instances wine with a quality rating of 3 or 4 seemed to show similar levels of composition compared with the wines that scored 8 or 9 and wine with a quality rating of 5 or 6 had too high a variance across the input variables to display a discernible pattern.

Overall, as mentioned earlier the plots suffer from overplotting, particularly with wine given a score of 5 or 6.

Researching the input variables showed that quality wine has a delicate balance of a number of variables, more than those covered by this dataset. The next steps would be to look at more variables, and look at interactions between variables in a higher level of detail.